An Improved Random Forest Classifier for Text Categorization
نویسندگان
چکیده
This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. We apply the proposed method on six text data sets with diverse characteristics. The results have demonstrated that this improved random forests outperformed the popular text classification methods in terms of classification performance. Index Terms — random forest, text categorization, random subspace, decision tree
منابع مشابه
Semi-Supervised Learning Based Prediction of Musculoskeletal Disorder Risk
This study explores a semi-supervised classification approach using random forest as a base classifier to classify the low-back disorders (LBDs) risk associated with the industrial jobs. Semi-supervised classification approach uses unlabeled data together with the small number of labelled data to create a better classifier. The results obtained by the proposed approach are compared with those o...
متن کاملText categorization on Reuters corpus
1 Task The task of text categorization can be described as follows: given a set of documents, we want to assign to each document one or more text categories or no category. In this term project, we want categorize documents from the well-known Reuters-21578 corpus which is a collection of 21578 articles published on Reuters in 1987. We have chosen only three most frequent text categories as the...
متن کاملAutomatic Categorization of Fanatic Texts
This paper presents a task of automatic categorization of fanatic texts. The analyzed set of texts stems from an Arabic environment in Kuwait, where teachers and students were asked questions regarding various terrorist tendencies. The responses were classified by a domain expert into one of three classes with respect to degree of fanaticism of their content. The main task was to develop an aut...
متن کاملForesTexter: An efficient random forest algorithm for imbalanced text categorization
In this paper, we propose a new Random Forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in b...
متن کاملAn Improved Algorithm of Bayesian Text Categorization
Text categorization is a fundamental methodology of text mining and a hot topic of the research of data mining and web mining in recent years. It plays an important role in building traditional information retrieval, web indexing architecture, Web information retrieval, and so on. This paper presents an improved algorithm of text categorization that combines the feature weighting technique with...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JCP
دوره 7 شماره
صفحات -
تاریخ انتشار 2012